Home:ALL Converter>Connect and Read Parquet file from S3 storage without apache hadoop dependency in Java

Connect and Read Parquet file from S3 storage without apache hadoop dependency in Java

Ask Time:2022-02-16T23:13:57         Author:Arul

Json Formatter

I have this requirement, Connect to S3 and read Parquet file and its contents in Java. I have used hadoop way of doing it and it works.

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version> 3.3.1</version>
    </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version> 3.3.1</version>
        </dependency>

However the above dependencies are bringing in Log4j 1.2.17 version which is vulnerable. apache hadoop version 3.3.1 was released which is the latest in June 2021 which is before Log4j vulnerability issue popped up.

Anyone know what is the work around? Can the following requirement can be achieved without hadoop dependency?

Here is my code that does the job

  ParquetFileMetaData parquetFileMetaData = null;
        String filePath = "s3a://" + bucket + parquetFilePath;
        Path path = new Path(filePath);
        ParquetMetadata readFooter = null;
        try {
            readFooter = ParquetFileReader.readFooter(config, path, ParquetMetadataConverter.NO_FILTER);
            MessageType schema = readFooter.getFileMetaData().getSchema();
            ParquetFileReader parquetFileReader = new ParquetFileReader(config, path, readFooter);
            parquetFileMetaData = new ParquetFileMetaData();
            parquetFileMetaData.setSchema(schema);
            parquetFileMetaData.setParquetFileReader(parquetFileReader);
        } catch (IOException e) {
            e.printStackTrace();
        }

My S3 is not Amazon S3.

Author:Arul,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/71144411/connect-and-read-parquet-file-from-s3-storage-without-apache-hadoop-dependency-i
yy